Rijkstrainees - Statistical Programming in R
Everything for today (and more) can be found at
Pipes
Data manipulation
Basic analysis (correlation & t-test)
Data visualization with ggplot2
Two learning curves: Practicals with and without answers
library(dplyr) # data manipulation library(magrittr) # pipes library(mice) # for the boys data library(ggplot2) # visualization library(DT) # fancy JS/html tables library(reshape2) # melt stuff
head(boys)
## age hgt wgt bmi hc gen phb tv reg ## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south ## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south ## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south ## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south ## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south ## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
At the end of this lecture we aim to understand what happens in
ggplot(mutate(na.omit(select(boys, age, hgt, reg)), height_meters = hgt/100),
aes(height_meters, age)) + geom_point(aes(group = reg))
boys %>% select(is.numeric) %>% cor(use = "pairwise.complete.obs") %>% round(3)
It effectively replaces round(cor(select(boys, is.numeric), use = "pairwise.complete.obs"), digits = 3).
Benefit: a single object in memory that is easy to interpret Your code becomes more readable:
f(x) becomes x %>% f()boys$age %>% mean()
## [1] 9.158866
f(x, y) becomes x %>% f(y)boys %>% head(n = 1)
## age hgt wgt bmi hc gen phb tv reg ## 3 0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
h(g(f(x))) becomes x %>% f %>% g %>% hboys %>% select(is.numeric) %>% na.omit() %>% colMeans
## age hgt wgt bmi hc tv ## 14.07467 165.42411 54.01027 19.08348 55.28884 11.89732
%>% pipe%$% pipe. in a pipeIn a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.
boys %>% plot(bmi ~ age, data = .)
VS
boys %$% plot(bmi ~ age)
The . can be used as a placeholder in the pipe.
boys %>% mutate(ovwgt = bmi > 25) %$% t.test(age ~ ovwgt)
## ## Welch Two Sample t-test ## ## data: age by ovwgt ## t = -15.971, df = 32.993, p-value < 2.2e-16 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## -9.393179 -7.270438 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 9.103392 17.435200
is the same as
t.test(age ~ (bmi > 25), data = boys)
boys %>% select(reg, age) %>% melt(id.vars = "reg", variable.name = "variable", value.name = "value") %>% datatable(options = list(pageLength = 25, scrollY = "250px"))
boys %>% select(reg, age) %>% melt(id.vars = "reg", variable.name = "variable", value.name = "value") %>% group_by(variable, reg) %>% summarise_all(list(mean = mean, sd = sd), na.rm = TRUE) %>% datatable(options = list(pageLength = 25, scrollY = "200px"))
boys %>% select(reg, where(is.numeric)) %>% melt(id.vars = "reg", variable.name = "variable", value.name = "value") %>% group_by(variable, reg) %>% summarise_all(list(mean = mean, sd = sd), na.rm = TRUE) %>% datatable(options = list(pageLength = 25, scrollY = "200px"))
boys %>% mutate(bmi_calc = wgt / (hgt/100)^2) %>% select(bmi, bmi_calc) %>% head()
## bmi bmi_calc ## 3 14.54 14.54177 ## 4 11.77 11.77395 ## 18 12.56 12.56000 ## 23 14.37 14.37589 ## 28 15.21 15.21361 ## 36 15.11 15.11241
boys %>%
mutate(reg = NULL,
gen = NULL,
phb = NULL) %>%
head()
## age hgt wgt bmi hc tv ## 3 0.035 50.1 3.650 14.54 33.7 NA ## 4 0.038 53.5 3.370 11.77 35.0 NA ## 18 0.057 50.0 3.140 12.56 35.2 NA ## 23 0.060 54.5 4.270 14.37 36.7 NA ## 28 0.062 57.5 5.030 15.21 37.3 NA ## 36 0.068 55.5 4.655 15.11 37.0 NA
boys %>% mutate(hgt = hgt/100) %>% tail()
## age hgt wgt bmi hc gen phb tv reg ## 7410 20.372 1.887 59.8 16.79 55.2 <NA> <NA> NA west ## 7418 20.429 1.811 67.2 20.48 56.6 <NA> <NA> NA north ## 7444 20.761 1.891 88.0 24.60 NA <NA> <NA> NA west ## 7447 20.780 1.935 75.4 20.13 NA <NA> <NA> NA west ## 7451 20.813 1.890 78.0 21.83 59.9 <NA> <NA> NA north ## 7475 21.177 1.818 76.5 23.14 NA <NA> <NA> NA east
boys %$% table(reg)
## reg ## north east west south city ## 81 161 239 191 73
boys %>% select(hgt, reg) %>% mutate(across(!hgt, as.numeric)) %$% table(reg)
## reg ## 1 2 3 4 5 ## 81 161 239 191 73
ggplot2anscombe dataanscombe
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89
anscombe %>% colMeans()
## x1 x2 x3 x4 y1 y2 y3 y4 ## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
anscombe %>% cor() %>% round(digits = 3) %>% .[1:4, 5:8]
## y1 y2 y3 y4 ## x1 0.816 0.816 0.816 -0.314 ## x2 0.816 0.816 0.816 -0.314 ## x3 0.816 0.816 0.816 -0.314 ## x4 -0.529 -0.718 -0.345 0.817
anscombe %>% var() %>% round(digits = 3) %>% .[1:4, 5:8]
## y1 y2 y3 y4 ## x1 5.501 5.500 5.497 -2.115 ## x2 5.501 5.500 5.497 -2.115 ## x3 5.501 5.500 5.497 -2.115 ## x4 -3.565 -4.841 -2.321 5.499
anscombe %>% ggplot(aes(y1, x1)) + geom_point() + geom_smooth(method = "lm")
Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.
ggplot2?Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.
With ggplot2 you
ggplot2 then takes care of the details
1: Provide the data
mice::boys %>% ggplot()
2: map variable to aesthetics
mice::boys %>% ggplot(aes(x = age, y = bmi))
3: state which geometric object to display
mice::boys %>% ggplot(aes(x = age, y = bmi)) + geom_point()
Create the plot
gg <- mice::boys %>% ggplot(aes(x = age, y = bmi)) + geom_point(col = "dark green")
Add another layer (smooth fit line)
gg <- gg + geom_smooth(col = "dark blue")
Give it some labels and a nice look
gg <- gg + labs(x = "Age", y = "BMI", title = "BMI trend for boys") + theme_minimal()
plot(gg)
ggplot(mutate(na.omit(select(boys, age, hgt, reg)), height_meters = hgt/100),
aes(height_meters, age)) + geom_point(aes(group = reg))
Is the same as
boys %>% select(age, hgt, reg) %>% # select features na.omit() %>% # remove missings. NAUGHTY! mutate(height_meters = hgt/100) %>% # transform height ggplot(aes(x = height_meters, y = age)) + # define plot aes geom_point(aes(group = reg)) # add geom